Professor Jessi Cisewski Kehe contributed to these lecture notes.
These notes introduce the Lake Mendota case study and the ggplot2 package for making plots
We use data from the Lake Mendota freeze/thaw data set to introduce several types of plots
Read the chapter on Lake Mendota from the Course Notes and Case Studies for background.
The R Markdown file for this case study contains all of the code needed to replicate the analysis from lecture.
However, lecture will skip quickly over the blocks of code that are not directly doing visualization.
year1 is the first year of the given
winter.duration is the total number of days that
Lake Mendota was closed by ice in the given winter. (Note: the word
“winter” used here specifies a single season spanning two consecutive
calendar years and does not use the endpoints from the winter solstice
to the spring equinox, for example.)duration.duration and
year1.COURSE represent your course directory.COURSE/lectures.COURSE/lecture/unit2-ggplot2/COURSE/data/week2-ggplot2.Rmd into the
unit2-ggplot2 sub-folder.lake-mendota-winters-2023.csv into
the COURSE/data/ folder.The following R chunk has one line of code that will read in the data set.
Along with the code I added many comments to explain what is happening.
Each code chunk begins with three consecutive back ticks at the start of a line.
The left brace { followed by the letter
r means that the code should be interpreted as r
code.
The name read-data is the name of the
chunk.
Naming chunks is helpful when trying to find errors.
Other knitr options could be set before the right brace
}, but are not in this example.
Note that each R chunk in a file requires a unique name.
We will examine read_csv() and other functions to
read in data later in the semester.
All characters from the first # to the end of the
line are comments
Note: It is okay to just execute this code without worrying about the details. We will go over reading in data in more detail in a later unit of the course.
## `mendota` is a new object.
## The `=` assigns the loaded data set to the object name `mendota`.
## You will often see `<-' instead as the assignment operator.
## I use `=` almost all the time, but `<-` is more common.
## read_csv() is a function in the tidyverse for reading in .csv files.
## There is a base R function named read.csv(). Use read_csv() instead.
## If you use read.csv(), character variables get changed to factors
## and variable names might get changed.
## The argument to read_csv() is a path to the file with the data
## The '..' means go up a directory
## Use a '/' after a directory
## The result of the code below is to create a data frame named mendota
## by reading in the data in the file.
### COURSE/data/ contains the data file
### COURSE/lectures/unit2-ggplot2/ is your working directory
mendota = read_csv("../../data/lake-mendota-winters-2023.csv")
read_csv(), a summary is
printed to the output which:
chr)dbl, which is short for
“double”, which means numerical data stored in the computer as a double
precision floating point number)read.csv() also reads in
the data, but is not as nice as read_csv().
read.csv() does not recognize the columns
with dates as such and treats them as character vectors which will lead
to problems when we want to work with them as dates.read.csv(), break the habit and use read_csv()
instead.After reading in a data set, I often examine it by clicking on its name in the “Environment” tab in the upper right panel.
The data will then be viewable in a tab in the upper left panel.
There are other commands to examine the structure of the data more carefully.
The command spec() is part of the tidyverse and
displays how each variable is specified after reading in.
spec() can be used to copy, paste, and
edit to ensure each column’s data type is read in correctly.Note: once you have modified a data set,
spec() may no longer work.
spec(mendota)
## cols(
## winter = col_character(),
## year1 = col_double(),
## intervals = col_double(),
## duration = col_double(),
## first_freeze = col_date(format = ""),
## last_thaw = col_date(format = ""),
## decade = col_double()
## )
ggplot() which takes the name of a data
frame as its first argument.geom_ followed by a descriptive
name.+ sign must be at the end of each line,
not the beginning of the line.+ at the end.)Creating plots in R is typically an iterative process. Begin with a basic plot and see what it looks like. Then, refine the plot by building on additional layers. The code for complete plots that we show almost always are the result of this process of refinement. The following notes demonstrate this iteration.
ggplot(data = mendota, mapping = aes(x = duration)) +
geom_histogram()
ggplot() very often and will soon memorize
that the first argument is data and the second is
mapping
ggplot(mendota, aes(x = duration)) +
geom_histogram()
+
sign++ meaning
more is to come.ggplot() is always used to create a new
plot.
ggplot() is named
data and is set to be the name of the data set which
contains the data to be plottedggplot() is named
mapping, which describe a mapping between
variables in the data set and aesthetics of the graph used to
display the corresponding variable.x or
y axes, a color, a shape, a size, or more.duration and
it is displayed in the plot by locations on the x axis by setting
mapping to be aes(x = duration).mapping is always set using the function
aes().geom_histogram() which
determines which “geometry” is used to display the corresponding
variables.
geom_histogram() creates a histogramggtitle(), xlab(), and
ylab()binwidth argumentboundary
argumentfill
argumentcolor
argumentfill, color,
binwidth, and boundary here are not
contained within a call to aes().
ggplot(mendota, aes(x = duration)) +
geom_histogram(fill = "hotpink",
color = "black",
binwidth = 7,
boundary = 0) +
xlab("Total days frozen") +
ylab("Counts") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023")
binwidth = 7 and
boundary = 0 determine which intervals will be used in the
histogram.
boundary = 7 had been specified, the bins
would be exactly the same: pick one boundary and
one bin width to determine the end points of
all bins.boundary = 10 had been specified, this would
have shifted all of the bin boundaries three to the right and the graph
would have different bins and different counts.center of one of the intervals instead of the
boundary between two adjacent ones.binwidth = 7
still.
ggplot(mendota, aes(x = duration)) +
geom_histogram(fill = "hotpink",
color = "black",
binwidth = 7,
center = 0) +
xlab("Total days frozen") +
ylab("Counts") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023")
ggplot()
carry over to all additional layers (some of which will ignore
them)data or
mapping arguments in the command to create the layer.ggplot(mendota, aes(x = duration)) +
geom_histogram(center = 70, binwidth = 60,
fill='hotpink', color='black') +
xlab("Total Days Frozen") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023",
subtitle = "(bad plot: too few bins)")
ggplot(mendota, aes(x=duration)) +
geom_histogram(center=70, binwidth=1,
fill='hotpink', color='black') +
xlab("Total Days Frozen") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023",
subtitle = "(bad plot: too many bins)")
ggplot(mendota, aes(x=duration)) +
geom_histogram(center=100, binwidth=10,
fill='hotpink', color='black') +
xlab("Total Days Frozen") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023",
subtitle = "(a decent plot: binwidth = 10)")
There is rarely a single best choice of bins in a histogram, but there are many ways to make a poor choice.
geom_density().ggplot(mendota) +
geom_density(aes(x=duration),
fill="hotpink",
color="black") +
xlab("Total days frozen") +
ylab("Density") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023")
ggplot(mendota) +
geom_density(aes(x=duration),
fill="hotpink",
color="black") +
geom_vline(xintercept = 80, linetype = "dashed") +
xlab("Total days frozen") +
ylab("Density") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023")
We can count the actual number of duration values
which are 80 or below.
In fact, exactly 17 of the 167 durations from the winters from 1855-56 to 2021-22, or about 10%, are 80 days or fewer.
A boxplot is a graphical depiction of some key quantiles of a distribution of numerical values.
Typical boxplots often also identify potential outliers which are extreme values far enough below the first quartile or above the third quartile.
This code chunk creates a vertical boxplot of
duration by mapping it to the aesthetic
y.
See what happens if we use x instead?
ggplot(mendota, aes(y = duration)) +
geom_boxplot()
The box outlines the middle 50% of the observations
The lower vertical line (i.e., the lower whisker) reaches down to the minimum value of the observations, but will not drop below Q1 - 1.5 x IQR, where IQR = Q3 - Q1 is the inter-quartile range.
Similarly, the higher vertical line (i.e., the upper whisker) reaches up to the maximum value of the observations, but will not go above Q3 + 1.5 x IQR.
Any points plotted below or above the vertical lines indicate observations that are below or above the noted ranges. These points may be considered potential outliers.
The number line on the x axis is meaningless and arbitrary.
duration
is mapped from y to x.ggplot(mendota, aes(x = duration)) +
geom_boxplot()
decade partitions each year
into the decade associated with its first year.x variable is set to
as.character(decade) so that decade is treated
as categorical and not numerical.ggplot(mendota, aes(x = as.character(decade), y = duration)) +
geom_boxplot(fill = "papayawhip",
color = "blue") +
ylab("Total days closed by ice") +
xlab("Decade") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023")
How do the total number of days that Lake Mendota is closed with ice vary over time?
year1 and
turned it into a categorical variable decade, collecting
the observations by decade, and plotted the data with side-by-side box
plots.year1 and
duration directly with an augmented scatter plot.year1
specifying the location on the x axis and duration the
location on the y axis.ggplot(mendota, aes(x = year1, y = duration)) +
geom_point()
geom_point()) to line segments
which connect consecutive points (using geom_line()).ggplot(mendota, aes(x = year1, y = duration)) +
geom_line()
geom_smooth(), greatly reduces the
noise and replaces the actual data values with a smooth curve.ggplot(mendota, aes(x = year1, y = duration)) +
geom_smooth()
geom_smooth() draws a
ribbon to represent uncertainty in the true value of the estimated trend
line.se
(short for standard error which we will see later in the
course) to be FALSE.ggplot(mendota, aes(x = year1, y = duration)) +
geom_smooth(se = FALSE)
## note setting the aesthetics here in ggplot()
## otherwise, we would need to set the in geom_point(), geom_line(), and geom_smooth() each separately
ggplot(mendota, aes(x=year1, y=duration)) +
geom_point() +
geom_line() +
geom_smooth(se=FALSE) +
xlab("Year") +
ylab("Total Days Frozen") +
ggtitle("Lake Mendota Freeze Durations, 1855-2023")
In a typical year, how much does the actual duration for which Lake Mendota is frozen (the surface is at least 50% covered by ice) vary from its predicted value? (How far apart are the black points and the blue curve, typically, in a given year)?
To address this question, we need to augment the data set by adding a numerical variable which includes this calculation.
A block of code we suppress in these notes accomplishes this to
create a variable named residual which we can
examine.
Here are the first few rows of the augmented data set.
width = Inf argument makes sure that all variables
are included in the printed output.print(mendota, width = Inf)
## # A tibble: 168 × 9
## winter year1 intervals duration first_freeze last_thaw decade fitted
## <chr> <dbl> <dbl> <dbl> <date> <date> <dbl> <dbl>
## 1 1855-56 1855 1 118 1855-12-18 1856-04-14 1850 125.
## 2 1856-57 1856 1 151 1856-12-06 1857-05-06 1850 124.
## 3 1857-58 1857 1 121 1857-11-25 1858-03-26 1850 124.
## 4 1858-59 1858 1 96 1858-12-08 1859-03-14 1850 123.
## 5 1859-60 1859 1 110 1859-12-07 1860-03-26 1850 123.
## 6 1860-61 1860 1 117 1860-12-14 1861-04-10 1860 122.
## 7 1861-62 1861 1 132 1861-12-02 1862-04-13 1860 121.
## 8 1862-63 1862 1 104 1862-12-26 1863-04-09 1860 121.
## 9 1863-64 1863 1 125 1863-12-18 1864-04-21 1860 120.
## 10 1864-65 1864 1 118 1864-12-08 1865-04-05 1860 120.
## residual
## <dbl>
## 1 -7.00
## 2 26.6
## 3 -2.77
## 4 -27.2
## 5 -12.6
## 6 -4.99
## 7 10.6
## 8 -16.8
## 9 4.72
## 10 -1.73
## # ℹ 158 more rows
In the first recorded winter of 1855–56, Lake Mendota was closed by ice for 118 days, but the trend curve in the same winter predicted 125 days. The residual is the difference, \(-7 = 118 - 125\).
Let’s create a density plot to visualize the distribution of all of the residuals.
ggplot(mendota, aes(x = residual)) +
geom_density(fill = "papayawhip")
It is more common for a winter where the actual duration for which Lake Mendota is closed by ice is extremely lower than typical than a winter where it is abnormally longer than typical.
The scale of the plot indicates that it is not at all unusual for the actual duration Lake Mendota is closed by ice to differ from the trend line by about 2 to 3 weeks in either direction.
Most of the area under the curve is between about \(-21\) days and \(+21\) days, but there are a fair number of years which exceed these typical values.
We could also examine the distribution of the residuals around the smooth curve with a box plot.
ggplot(mendota, aes(x = residual)) +
geom_boxplot()
When we learn about dplyr next week, we will develop some skills for summarizing data.
A common way to summarize a collection of numbers is the mean, or the sum divided by the number of values.
\[ \bar{x} = \frac{\sum_{x=1}^n x_i}{n} \]
\[ s = \sqrt{ \frac{\sum_{x=1}^n (x_i - \bar{x})^2}{n-1} } \]
The standard deviation is almost the square root of the mean squared deviations from the mean (almost because we divide by \(n-1\) and not \(n\)).
The standard deviation may often be interpreted as about the size of a typical deviation of an individual value from the mean.
A quantile is a value which indicates a location in a distribution by indicating the proportion of values below (and above).
## # A tibble: 1 × 8
## mean median sd q10 q25 q50 q75 q90
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.0931 2.02 16.7 -20.4 -8.38 2.02 10.7 19.6
ggplot(mendota, aes(x = residual)) +
geom_density(fill = "papayawhip") +
geom_hline(yintercept = 0) +
geom_vline(xintercept = c(-1,1)*16.7,
color = "red", linetype = "dotted") +
geom_vline(xintercept = c(-1,1)*2*16.7,
color = "red", linetype = "dashed") +
geom_vline(xintercept = c(-1,1)*3*16.7,
color = "red", linetype = "solid")
xlab() and ylab() to modify the displayed axis
labels in a plot.ggtitle() to add a title
to a plot.xlab() and ylab() are
almost always just a simple text string, such as we have seen."".ggtitle() are also just
a simple string as an argument.subtitle to
add a second line to the displayed title in a smaller font.
ggplot(mendota, aes(x = residual)) +
geom_density(fill = "papayawhip") +
xlab("Residuals (days)") +
ylab("") +
ggtitle("Density Plot of Residuals from geom_smooth()",
subtitle = "Lake Mendota Freeze data, 1855-2023")
When in the winter does Lake Mendota typically close with ice the first time?
To answer this question graphically, I decided to create a new
variable ff_period which categorizes the date of the first
freeze into a half-month time period, from “Nov 16-30” through “Jan
16-31”
The variable ff_after_june_30 counts the number of
days after June 30 the Lake Mendota first freezes in a given winter.
A suppressed block of code adds the variables
ff_after_june_30 and ff_period to the
mendota data set.
A new data frame mendota_sum contains counts of the
number of winters in each period.
Here are the first few rows of selected columns of each data set.
## # A tibble: 5 × 4
## winter duration first_freeze ff_period
## <chr> <dbl> <date> <fct>
## 1 1855-56 118 1855-12-18 Dec 16-31
## 2 1856-57 151 1856-12-06 Dec 1-15
## 3 1857-58 121 1857-11-25 Nov 16-30
## 4 1858-59 96 1858-12-08 Dec 1-15
## 5 1859-60 110 1859-12-07 Dec 1-15
## # A tibble: 5 × 2
## ff_period n
## <fct> <int>
## 1 Nov 16-30 4
## 2 Dec 1-15 47
## 3 Dec 16-31 81
## 4 Jan 1-15 34
## 5 Jan 16-30 2
geom_bar() is used
to create bar graphs from a data set where the heights of the bars need
t be calculated.geom_col() is used when there is one column for the
categorical variable and another column which contains the heights of
the bars.ff_period
variable of the mendota data set.
ggplot(mendota, aes(x = ff_period)) +
geom_bar()
ggplot(mendota, aes(x = ff_period)) +
geom_bar(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Count") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023")
mendota_sum
## # A tibble: 5 × 2
## ff_period n
## <fct> <int>
## 1 Nov 16-30 4
## 2 Dec 1-15 47
## 3 Dec 16-31 81
## 4 Jan 1-15 34
## 5 Jan 16-30 2
geom_col() and set both x and
y aesthetics.ggplot(mendota_sum, aes(x = ff_period, y = n)) +
geom_col()
ggplot(mendota_sum, aes(x = ff_period, y = n)) +
geom_col(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Count") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023")
period50 to the mendota data set which divides
the years into 50-year periods.facet_wrap() partitions the data by one or
more categorical variablesvars() to list the variable on which
to facet.
~period50
to specify the faceting variable.ggplot(mendota, aes(x = ff_period)) +
geom_bar(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Count") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_wrap(vars(period50))
facet_wrap()facet_grid() is used when you want to
organize the rows and/or columns of the facets by one or more
variables.period50.ggplot(mendota, aes(x = ff_period)) +
geom_bar(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Count") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_grid(rows = vars(period50))
period50.ggplot(mendota, aes(x = ff_period)) +
geom_bar(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Count") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_grid(cols = vars(period50))
theme() controls many features of a
plot.axis.text.x
using the function element_text().
angle = 45 rotates the text 45 degreeshjust = 1 justifies the text to the right (0 would be
left, 0.5 in the middle)vjust = 1 justifies the text to the top (0 would be the
bottom, 0.5 in the middle).theme().ggplot(mendota, aes(x = ff_period)) +
geom_bar(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Count") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_grid(cols = vars(period50)) +
theme(axis.text.x = element_text(
angle = 45, hjust = 1, vjust = 1))
For a single variable, facet_wrap() usually is
preferable.
For a different data set where breaking up by two different
categorical variables, facet_grid() can be very
useful.
You can see more examples at the tidyverse
reference page for facet_grid()
mendota
data set which we name mendota_sum2 by counting the number
of observations within each 50-year period and first freeze 2-week
category in a column n
p is the proportion of these counts within
each 50-year periodpct are these values as percentages## # A tibble: 15 × 5
## # Groups: period50 [4]
## period50 ff_period n p pct
## <chr> <fct> <int> <dbl> <dbl>
## 1 1855-1900 Nov 16-30 4 0.0870 8.70
## 2 1855-1900 Dec 1-15 17 0.370 37.0
## 3 1855-1900 Dec 16-31 19 0.413 41.3
## 4 1855-1900 Jan 1-15 6 0.130 13.0
## 5 1901-1950 Dec 1-15 14 0.28 28
## 6 1901-1950 Dec 16-31 27 0.54 54
## 7 1901-1950 Jan 1-15 8 0.16 16
## 8 1901-1950 Jan 16-30 1 0.02 2
## 9 1951-2000 Dec 1-15 14 0.28 28
## 10 1951-2000 Dec 16-31 27 0.54 54
## 11 1951-2000 Jan 1-15 9 0.18 18
## 12 2001-2023 Dec 1-15 2 0.0909 9.09
## 13 2001-2023 Dec 16-31 8 0.364 36.4
## 14 2001-2023 Jan 1-15 11 0.5 50
## 15 2001-2023 Jan 16-30 1 0.0455 4.55
geom_bar() from the winter-level data as
aboveggplot(mendota, aes(x = ff_period)) +
geom_bar(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_wrap(vars(period50))
pct on the y axisggplot(mendota_sum2, aes(x = ff_period, y = pct)) +
geom_col(color = "black", fill = "blue") +
xlab("Date of the First Freeze") +
ylab("Percentage") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_wrap(vars(period50))
scale_y_continuous() to change the labels on the y axis to
percentagesy = p), but that changing the labels to percentages
adjusts by multiplying by 100 automatically.ggplot(mendota_sum2, aes(x = ff_period, y = p)) +
geom_col(color = "black", fill = "blue") +
scale_y_continuous(labels = scales::percent) +
xlab("Date of the First Freeze") +
ylab("") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023") +
facet_wrap(vars(period50))
scale_*() commands include
scale_y_discrete(), scale_x_continuous(),
scale_x_discrete().ff_after_june_30
values (number of days after June 30) with the actual dates.ggplot() object and then
reusing it
## ff_after_june_30 directly
g = ggplot(mendota, aes(x = ff_after_june_30)) +
geom_density() +
geom_hline(yintercept = 0) +
xlab("Days after June 30 of the First Freeze") +
ggtitle("Lake Mendota First Freeze Dates",
subtitle = "1855-2023")
g
## relabeled axes
## lubridate commands (ignore for now!)
dec01 = as.numeric(ymd("2023-12-01") - ymd("2023-06-30"))
dec15 = as.numeric(ymd("2023-12-15") - ymd("2023-06-30"))
jan01 = as.numeric(ymd("2024-01-01") - ymd("2024-06-30"))
jan15 = as.numeric(ymd("2024-01-15") - ymd("2024-06-30"))
g +
scale_x_continuous(breaks = c(dec01, dec15, jan01, jan15),
labels = c("Dec 1", "Dec 15", "Jan 1", "Jan 15")) +
xlab("Date of the First Freeze")
scale_*() commands for other aesthetics
such as color, fill, and size.g = ggplot(mendota, aes(x = year1, y = duration)) +
geom_line() +
geom_point(aes(col = ff_period), size = 2) +
xlab("Year") +
ylab("Total Days Closed by Ice") +
ggtitle("Lake Mendota Freeze Durations over Time",
subtitle = "1855-2023")
g
ggplot2 is not friendly to
color blind people.
g = g +
scale_color_viridis_d(direction = -1)
g
ff_period. A better choice for presentation would be
something like “First Freeze Time”.guides() allows many customization of the
legends (guides) which accompany plots.g = g +
guides(color = guide_legend(title = "First Freeze Time"))
g
g + theme_bw()
g + theme_minimal()
The tidyverse list of complete themes are part of the tidyverse online reference.
More examples on modifying themes are in the tidyverse reference guides page
ggplot()
ggplot() will specify:
data is the first argument), andaes()
(mapping is the second argument)x, y,
color, size, and shape which vary
according to the values of variables are mapped to these variables with
the function aes()
ggplot() carry over to all subsequent
layersgeom_histogram()geom_density()geom_boxplot()geom_point()geom_line()geom_bar() when the function does the
countinggeom_col() when you supply the bar
heightsfacet_wrap() to display a sequence of facets in
rows, letting the computer decide where to break the rowsfacet_grid() when you want to decide which
variables vary by rows and columns in the displaysviridis or other color sales regularly
instead of default valuesscale_y_continuous(),
scale_x_discrete() and scale_color_discrete()
(among many others) can maodify the defaults for how aesthetics are
displayed in a plot.guides() function to control the guides for
aesthetics in a plot
theme_bw() changes to a black and white themetheme_minimal() removes the boundary box around the
plot and also uses a black and white theme